Calculating Correlation

Question:

Start Quiz:

import pandas as pd

filename = '/datasets/ud170/subway/nyc_subway_weather.csv'
subway_df = pd.read_csv(filename)

def correlation(x, y):
    '''
    Fill in this function to compute the correlation between the two
    input variables. Each input is either a NumPy array or a Pandas
    Series.
    
    correlation = average of (x in standard units) times (y in standard units)
    
    Remember to pass the argument "ddof=0" to the Pandas std() function!
    '''
    return None

entries = subway_df['ENTRIESn_hourly']
cum_entries = subway_df['ENTRIESn']
rain = subway_df['meanprecipi']
temp = subway_df['meantempi']

print correlation(entries, rain)
print correlation(entries, temp)
print correlation(rain, temp)

print correlation(entries, cum_entries)

Solution:

INSTRUCTOR NOTE:

Understand and Interpreting Correlations

This page contains some scatterplots of variables with different values of correlation.
This page lets you use a slider to change the correlation and see how the data might look.
Pearson's r only measures linear correlation! This image shows some different linear and non-linear relationships and what Pearson's r will be for those relationships.

Corrected vs. Uncorrected Standard Deviation

By default, Pandas' std() function computes the standard deviation using Bessel's correction. Calling std(ddof=0) ensures that Bessel's correction will not be used.

Previous Exercise

The exercise where you used a simple heuristic to estimate correlation was the "Pandas Series" exercise in the previous lesson, "NumPy and Pandas for 1D Data".

Pearson's r in NumPy

NumPy's corrcoef() function can be used to calculate Pearson's r, also known as the correlation coefficient.